{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Self-querying retriever with elasticsearch and langchain\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/langchain/notebooks/langchain/self-query-retriever-examples/langchain-self-query-retriever.ipynb)\n",
    "\n",
    "This workbook demonstrates example of Elasticsearch's [Self-query retriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) to convert unstructured query into a structured query and apply structured query to a vectorstore. \n",
    "\n",
    "Before we begin, we first split the documents into chunks with `langchain` and then using [`ElasticsearchStore.from_documents`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents), we create a `vectorstore` and index data to elasticsearch.\n",
    "\n",
    "\n",
    "We will then see few examples query demonstrating full power of elasticsearch powered self-query retriever.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install packages and import modules\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!python3 -m pip install -qU lark langchain langchain-elasticsearch openai\n",
    "\n",
    "from langchain.schema import Document\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain_elasticsearch import ElasticsearchStore\n",
    "from langchain.llms import OpenAI\n",
    "from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
    "from langchain.chains.query_constructor.base import AttributeInfo\n",
    "from getpass import getpass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create documents \n",
    "Next, we will create list of documents with summary of movies using [langchain Schema Document](https://api.python.langchain.com/en/latest/schema/langchain.schema.document.Document.html), containing each document's `page_content` and `metadata` .\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = [\n",
    "    Document(\n",
    "        page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n",
    "        metadata={\n",
    "            \"year\": 1993,\n",
    "            \"rating\": 7.7,\n",
    "            \"genre\": \"science fiction\",\n",
    "            \"director\": \"Steven Spielberg\",\n",
    "            \"title\": \"Jurassic Park\",\n",
    "        },\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n",
    "        metadata={\n",
    "            \"year\": 2010,\n",
    "            \"director\": \"Christopher Nolan\",\n",
    "            \"rating\": 8.2,\n",
    "            \"title\": \"Inception\",\n",
    "        },\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n",
    "        metadata={\n",
    "            \"year\": 2006,\n",
    "            \"director\": \"Satoshi Kon\",\n",
    "            \"rating\": 8.6,\n",
    "            \"title\": \"Paprika\",\n",
    "        },\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n",
    "        metadata={\n",
    "            \"year\": 2019,\n",
    "            \"director\": \"Greta Gerwig\",\n",
    "            \"rating\": 8.3,\n",
    "            \"title\": \"Little Women\",\n",
    "        },\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Toys come alive and have a blast doing so\",\n",
    "        metadata={\n",
    "            \"year\": 1995,\n",
    "            \"genre\": \"animated\",\n",
    "            \"director\": \"John Lasseter\",\n",
    "            \"rating\": 8.3,\n",
    "            \"title\": \"Toy Story\",\n",
    "        },\n",
    "    ),\n",
    "    Document(\n",
    "        page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n",
    "        metadata={\n",
    "            \"year\": 1979,\n",
    "            \"rating\": 9.9,\n",
    "            \"director\": \"Andrei Tarkovsky\",\n",
    "            \"genre\": \"science fiction\",\n",
    "            \"rating\": 9.9,\n",
    "            \"title\": \"Stalker\",\n",
    "        },\n",
    "    ),\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to Elasticsearch\n",
    "\n",
    "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. \n",
    "\n",
    "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.\n",
    "\n",
    "\n",
    "We will use [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) to connect to our elastic cloud deployment, This would help create and index data easily.  We would also send list of documents that we created in the previous step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
    "\n",
    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
    "\n",
    "# https://platform.openai.com/api-keys\n",
    "OPENAI_API_KEY = getpass(\"OpenAI API key: \")\n",
    "\n",
    "embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n",
    "\n",
    "\n",
    "vectorstore = ElasticsearchStore.from_documents(\n",
    "    docs,\n",
    "    embeddings,\n",
    "    index_name=\"elasticsearch-self-query-demo\",\n",
    "    es_cloud_id=ELASTIC_CLOUD_ID,\n",
    "    es_api_key=ELASTIC_API_KEY,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup query retriever\n",
    "\n",
    "Next we will instantiate self-query retriever by providing a bit information about our document attributes and a short description about the document. \n",
    "\n",
    "We will then instantiate retriever with [SelfQueryRetriever.from_llm](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add details about metadata fields\n",
    "metadata_field_info = [\n",
    "    AttributeInfo(\n",
    "        name=\"genre\",\n",
    "        description=\"The genre of the movie. Can be either 'science fiction' or 'animated'.\",\n",
    "        type=\"string or list[string]\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"year\",\n",
    "        description=\"The year the movie was released\",\n",
    "        type=\"integer\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"director\",\n",
    "        description=\"The name of the movie director\",\n",
    "        type=\"string\",\n",
    "    ),\n",
    "    AttributeInfo(\n",
    "        name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n",
    "    ),\n",
    "]\n",
    "\n",
    "document_content_description = \"Brief summary of a movie\"\n",
    "\n",
    "# Set up openAI llm with sampling temperature 0\n",
    "llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)\n",
    "\n",
    "# instantiate retriever\n",
    "retriever = SelfQueryRetriever.from_llm(\n",
    "    llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test retriever with simple query\n",
    "\n",
    "We will test the retriever with a simple query:  `What are some movies about dream`. \n",
    "\n",
    "The output shows all the relevant documents to the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n",
       " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6}),\n",
       " Document(page_content='Toys come alive and have a blast doing so', metadata={'year': 1995, 'genre': 'animated'}),\n",
       " Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'year': 2019, 'director': 'Greta Gerwig', 'rating': 8.3})]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This example only specifies a relevant query\n",
    "retriever.get_relevant_documents(\"What are some movies about dream\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test retriever with simple query and filter\n",
    "\n",
    "We will now test the retriever with a query:  `Has Andrei Tarkovsky directed any science fiction movies`. \n",
    "\n",
    "This query has a filter on the metadata `genre` and  `director`. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'year': 1979, 'rating': 9.9, 'director': 'Andrei Tarkovsky', 'genre': 'science fiction'})]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.get_relevant_documents(\n",
    "    \"Has Andrei Tarkovsky directed any science fiction movies\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate retriever to filter k documents\n",
    "\n",
    "We will now instantiate retriever again to fetch k number of documents. We can do this my setting `enable_limit=True` when instantiating the retriever. \n",
    "\n",
    "We will then test retriever to filter k documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = SelfQueryRetriever.from_llm(\n",
    "    llm,\n",
    "    vectorstore,\n",
    "    document_content_description,\n",
    "    metadata_field_info,\n",
    "    enable_limit=True,\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test the retriever to filter k documents\n",
    "\n",
    "We will now test the retriever with a query:  `what are two movies about dream`. \n",
    "\n",
    "The output would show exactly `2` documents. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'year': 2010, 'director': 'Christopher Nolan', 'rating': 8.2}),\n",
       " Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.get_relevant_documents(\"what are two movies about dream\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test retriever for complex queries\n",
    "\n",
    "We will try some complex queries with filters and `1 limit`.\n",
    "\n",
    "\n",
    "Query: `Show that one movie which was about dream and was released after the year 1992 but before 2007?`. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'year': 2006, 'director': 'Satoshi Kon', 'rating': 8.6})]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "retriever.get_relevant_documents(\n",
    "    \"Show that one movie which was about dream and was released after the year 1992 but before 2007?\"\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.11.4 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.3"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
